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Abstract: Consider a multi-phase project management problem where the 
decision maker needs to deal with two issues: (a) how to allocate resources to 
projects within each phase, and (b) when to enter the next phase, so that the 
total expected reward is as large as possible. We formulate the problem as a 
multi-armed bandit problem with precedence relations. In Chan, Fuh and Hu 
(2005), a class of asymptotically optimal arm-pulling strategies is constructed 
to minimize the shortfall from perfect information payoff. Here we further 
explore optimality properties of the proposed strategies. First, we show that 
the efficiency benchmark, which is given by the regret lower bound, reduces to 
those in Lai and Robbins (1985), Hu and Wei (1989), and Fuh and Hu (2000). 
This implies that the proposed strategy is also optimal under the settings of 
aforementioned papers. Secondly, we establish the super-efficiency of proposed 
strategies when the bad set is empty. Thirdly, we show that they are still 
optimal with constant switching cost between arms. In addition, we prove that 
the Wald's equation holds for Markov chains under Harris recurrent condition, 
which is an important tool in studying the efficiency of the proposed strategies. 



1. Introduction 

Suppose there are U = Ji + • • • + J/ statistical populations, nii,ni2, . . . ^Tljj^. 
Pulling arm ij once corresponds to taking an observation from population 11^ . The 
observations from 11^^ form a Markov chain on a state space D with transition 
probability density function Pij(x, y, 9) with respect to a cr-finite measure Q, where 
is an unknown parameter belonging to a parameter space Q. The stationary 
probability distribution for the Markov chain exists and has probability density 
function 7ry(-, 9). 

At each step, we are required to sample one of the statistical populations obeying 
the partial order ij ^ i'j' i < i'. An adaptive policy is a sampling rule that 
dictates, at each step, which population should be sampled based on observations 
before that step. We can represent a policy as a sequence of random variables 
(j) = {4)t\(l)t-i ^ 0t, i = 1, 2, . . .} taking values in {ij\i = 1, . . . ,/; j = 1, . . . , J^} 
such that the event {(pt = ij} 'take an observation from Hij at step t' belongs to 
the (T-field generated by 4>i, Xi, . . . , (j>t-i, Xt-i, where Xt denotes the state of the 
population being sampled at t-th step. 
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Let the initial state of Ily be distributed according to Vij{-] 6). Throughout this 
paper, we shaU use the notation Eg (Pg) to denote expectation (probability) with 
respect to the initial distribution Vij{-]6)\ similarly, i?7r(e) to denote expectation 
with respect to the stationary distribution nij{-;6). We shall assume that Vy = 
{x G D : Vij{x] 9) > 0} does not depend on 9 and Vij :— infxeVij TJ^ie.e'<EeWijix', 9)/ 
h'ij{x;9')] > for all Suppose that J^,-^ \g{x)\iTij{x;9)Q{dx) < oo. Let 

A'ij(^)= / g{x)nij{x;9)Q{dx) 
JxeD 

be the mean reward under stationary distribution TTy when is sampled once. 
Let TV be the total sample size from all populations, and 

N 

(1-1) TN{ij) =J2^{'i't=in 

t=\ 

be the sample size from Ily and 1 denotes the indicator function. It follows that 
the total reward equals 

N I Ji 

t=i i=\ j=\ 

In the case of independent rewards, that is, when pij{x,y,;9) — pij{y;9) for all 

i,j,x,y and 9, Wn{9) = Z]f=i S/li My"(^)-E'0^V(u)- We shall show in the Ap- 
pendix that for Markovian rewards, under regularity conditions A3-A4 (see Section 
2.1), there exists a constant Co < oo independent of9 E Q, N > and the strategy 
(j) such that 

/ Ji 

(1.3) \Wn{9) - ^^M»,(^)^eT^(«j)| < ^^0- 

1=1 j=i 

In light of (jl.Sp . maximizing Wn{9) is asymptotically equivalent [up to a 0(1) 
term] to minimizing the regret 

(1.4) RN{9):=Nfi*{9)~WN{9)^ ^ [tJ.*{9) ~ ^,,{9)]EeTN{ij), 

where ^jl*{9) :— maxi<i</ maxi<j<j; /iij(0). 

Because adaptive strategies </> that are optimal for all £ 6 and large N in 
general do not exist, we consider the class of all (asymptotically) uniformly good 
adaptive strategies under the partial order constraint ^, satisfying 

(1.5) i?Ar(6') o(A^"), for aU a > and 61 G e. 

Such strategies have regret that does not increase too rapidly for any 9 E &. We 
would like to find a strategy that minimizes the increasing rate of the regret within 
the class of uniformly good adaptive strategies under the partial order constraint ^. 

The rest of the article is organized as follows. In Section 2, we present the assump- 
tions and introduce the concept of bad sets. The regret lower bound is investigated 
in Section 3. We also prove that the regret lower bound specializes to other lower 
bounds obtained by previous authors under less general settings. Section 4 con- 
tains the super efficiency result when the bad sets are empty. The optimality of the 
proposed strategies under constant switching cost is investigated in Section 5. The 
last section includes the proof of Wald's equation for Markov random walks under 
Harris recurrence condition. 
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2. The assumption and bad sets 

Denote the Kullback-Leibler information number by 

(2.1) Iij{e,e')= I I log [ ^' ?J ftj(a;, y, 0)nij (x; e)Q{dy)Q{dx). 

Then, < Iij{9,9') < oo. We shall assume that Iij{9,9') < oo for all i,j and 
9,9' e 6. Let Hi{9) = maxi<j<j. Hij{6) be the largest reward in the i-th group of 
arms, and 

(2.2) Oi = {6* G e : ij,i{0) > iii.{9) for all i' < i and iJ,i{9) > fii>{e) for all i' > i} 
be the set of parameter values such that the first optimal job is in group i. Let 

(2.3) e^J={9&e^■.|I^J{9)=^l^i9)} 

be the parameter set such that arm ij is one of the first optimal ones. Each 6 € Q 
belongs to exactly one 9^ but may belong to more than one Qij . Let 

(2.4) e* = {6* e e : Hi{e) > ^,{9) for all i' + i) 

be the parameter set in which all the optimal arms lie in group i. Clearly, 0* C ©i 
but the reverse relation is not necessarily true. 



2.1. The assumptions 

We now state a set of assumptions that will be used to prove the optimality results. 

Let 9 be a compact subset of R'' for some d > 1. 

Al. /Uij(-) are finite and continuous on 9 for all i,j. Moreover, no arm group is 
redundant in the sense that 9* 7^ for alH = 1, . . . , 7. 

A2. T,jLihj{0,0') > for aU 9' ^ 9 and infe/ge,. 7^(6', 61') > for all 1 < i < 
I,l<J<Ji and 9 e U£>i9^ 

A3. For each j = 1,..., Jj,i = 1,...,7 and ^ G 9, {Xijt,t > 0} is a Markov 
chain on a state space D with cr-algebra 2?, irreducible with respect to a 
maximal irreducible measure on [D, D) and aperiodic. Furthermore, Xijt is 
Harris recurrent in the sense that there exists a set Gij G T>, aij > and 
probability measure ipij on Gij such that Pfj{Xijt G Gij i.o.\Xijo = a;} = 1 
for all a; G 73 and 

(2.5) Pij{Xiji G A\Xijo =x}> aijipij{A) for aU x G and AeV. 

A4. There exist constants 0<6<1,6>0 and drift functions Vij : 7) — > [1, 00) 
such that for all j = 1, . . . , Jj and i = 1, . . . , 7, 

(2.6) sup \g{x)\/Vij{x) < 00, 

xeD 

and for all a; G 7?, 6' G 9, 

(2.7) PfMjix) < (1 - mj{x) + blG,,{x), 
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where satisfies (|2.5p and PfjVij{x) — JjjVij{y)P^j{x,dy). Moreover, we 
require that 



(2.8) / Vij{x)vij{dx]9)Q{dx) < oo and V*j := sup Vij{x) < oo. 
Jd xeGij 

Let £ij {x, y; 9,9') = \og[pij {x, y; 9) /pij (x, ?/; 9')] be the log likehhood ratio be- 
tween Pf^ and Pf/ and Ns{9) = {9' : \\9 -9'\\<S} a ball of radius d around 
9, where || • || denotes Euclidean norm. 
A5. There exists S > such that for aU 9, 9' e 9, 

Eelsnpg^^ efj{X,jo, X^i; 6*, 9)\XijQ = x] 

(2.9) Ke^0> := sup — tt-T^ < ^ 

for all J = 1, . . . , Ji, i = 1, . . . , /. Moreover, 

(2.10) sup \£^j{x,y;9\9)\ ^0 as 5' 

for aU x,y e D and 9' G 8. 

Assumption Al is a mild regularity condition to exclude unrealistic models. A2 
is a positive information criterion: the first inequality makes sure that information 
is available in the first arm group to estimate 9; while the second inequality allows 
us to collect information in the i-th arm group for moving to the next group when 
9 G Qe for some £ > i. Assumption A3 is a recurrence condition and A4 is a drift 
condition. These two conditions are used to guarantee the stability of the Markov 
chain so that the strong law of large numbers and Wald's equation hold. A5 is a 
finite second moment condition that allows us to bound the probability that the 
MLE of 9 lies outside a small neighborhood of 9. This bound is important for us 
to determine the level of unequal allocation of observations that can be permitted 
in the testing stage of our procedure. The proof of the asymptotic lower bound in 
Theorem 1 requires only A1-A3; while additional A4 and A5 are required for the 
construction of efficient strategies attaining the lower bound. 



2.2. Bad sets 

The bad set is a useful concept for understanding the learning required within the 
group containing optimal arms. It is associated with the asymptotic lower bound 
described in Section 3 and is used explicitly in constructing the asymptotically 
efficient strategy. For 9 e Q^, define J{9) = {j : fJ.*{9) = fiij{9)} as the set of 
optimal jobs in group £. Hence 9 G Qgj if and only if j G Ji9). We also define 
the bad set, the set of 'bad' parameter values associated with 9, as all 9' e 9^ 
which cannot be distinguished from 9 by processing any of the optimal jobs £j. 
Specifically, 

(2.11) B,{9) = [9' eei\{ [j 9,,) :/£,((?, 0') = O for all jGJ((?)}. 

jeJ{e) 

The bad set Bi{9) is the intersection of two parameter sets. One set consists of 
parameter values that have different optimal arms from those for 9. The other set 
contains parameter values that cannot be distinguished from sampling the optimal 
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arm for 0. When a parameter value is in the intersection, sampling from arms that 
are non-optimal for is required. 

We note that if lij{0, 6') = 0, then the transition probabilities of Xijt are identi- 
cal under both and 6' . liO' £ BiiO), then by definition, 6' ^ Ujg,/(6/)6fj and hence 
J{e') n J{e) = 0. Let j G J{e) and / G J{e'). Then nij>{e') > fiij{9') = Hij{9) > 
^ieriO). Thus 

(2.12) lej, {9, 9') > for all 9' e Be{9) and / e J{9'). 

The interpretation of (|2.12[) is as follows. Although we cannot distinguish 9 from 
9' G Be{9) when sampling the optimal arm for 0, we can distinguish them by 
sampling the optimal job for 9' . This fact explains the necessity of processing non- 
optimal arms to collect information. 



3. The regret lower bound 

The following theorem gives an asymptotic lower bound for the regret (|1.4|) of 
uniformly good adaptive strategies under the partial order constraint ^. The proof 
can be found in We will discuss the relation of the lower bound with those in 
1,0 and i. 

Theorem 1. Assume A1-A3 and let 9 G G^. For any uniformly good adaptive 
strategy (j) under the partial order constraint ^, 

(3.1) liminfi?Ar(6')/logiV > 2(6*,^), 

A''— *oo 

where z{9,£) is the minimum value of the following minimization problem. 
Ji 

(3.2) Mimmzze ^^[^*((?) - n^j{9)]z,j{9) + ^ [/i*(0) - fii,{9)]zi,{9), 
subject to Zij{9) > 0, j = 1, . . . , Jt, if i < (, j ^ J{9), if i = i, 



Me-eeAEjLihjiO,9')zi,{9)} > 1, 



infe'eeAEtihA()^(^')^ijW+Etil2j{0,9')z2,{9)} > 1, 



(3.3) <^ 



infe.ee,_,{E;ii (^,^')^y W + ' ' ■ + E'il' /(,-i),(f?, 0')^(£-ib (^)} > 1, 
[infe'es.(e){E.<^E/=i I^,{(^^0'>^A()) + E,^j(e) IfA(^,0')ze,{e)} > 1. 

Corollary 1. When there is only one group of arms, iS.l]} reduces to the lower 
bound (1.11) of Lai and Bobbins [7|]. 



Proof. When there is only group of arms, only the last inequality of (|3.3p is needed 
and it takes the form 



(3.4) 
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In 01) it is proved that 

(3.5) EoTi,{j)>^^^^ for all J ^ J(0), 

where 9* = maxi<i<fe 6*;. Note that in f^, all jobs belong to the same family of 
probability distributions with different parameter values, and thus the KL informa- 
tion number does not depend on the job label but only the parameter value. Let 
EqTn (j)/ log N = Zj{6), then (^3]) is the same as 

(3.6) Zj{e)I{ej,e*)>l for aU j ^ 7(6*). 

We first show that ([33) => jSS]). Because implies that for aU 6' e B{9) 

(3.7) Yl HOj,d',)z,iO)>^- 

If e' = {e[, ...,0k)e B{e), then e* = e^, = e'^, and maxi<,<fe e',>e*. Suppose we 
choose a sequence of 9' G B{6) such that there is only one component 9^ approaching 
6* from above and other components 9j,, f ^ J{9), all have the same values as the 
corresponding components of 0. Taking infimum over this sequence of 9' G B{9) in 
(|X7)) . we obtain dS^. This complete the proof of ((^3)) . 

To prove p.Sp ^ p.4p . we assume that (|3.4I) does not hold. That is, there exist 
a 6*' e 5(6') such that 

I{9,,9'^)z,{9)<l. 

Because 9' e B{9), there exists at least one component 9'^^ oi 9' such that 9j^ > 9*. 
Then the preceding inequality and the property of exponential families imply that 

z,,i9)Ii9,,,9*)<z,49)I{9,,,9'^,)<l, 

and thus p.6p does not hold. This establishes (|3.5p ^ (13. 4p and the proof is com- 
plete. □ 



Corollary 2. When there is only one arm in each group, then 113. 1\) reduces to the 
lower bound (1-17) of Hu and Wei Q. 

Proof. In Hu and Wei [6| , the set Qi are intervals of 9?. Thus the infimum over 9^ 
is achieved at the end points of the intervals. Furthermore, because there is only 
one arm in each group, the bad sets are all empty and therefore the last inequality 
in p.3p is not needed. In view of these facts, it is straightforward to show that 
the systems of inequalities p. 31) reduces to (1.14) of Hu and Wei [f>]. The proof is 
complete. □ 



Corollary 3. When there is only one arm in each group, the lower bound S3. 1\) 
reduces to (3.2) of Fuh and Hu [3[. 

Proof. The assumptions A3 and A4 of Fuh and Hu correspond to the regularity 
condition Al and the positive information criterion A2 in Section 2, respectively. 
The Al, A2 and A5 of Fuh and Hu are essentially the same as Harris recurrence 
condition A3, the drift condition A4, and the finite second moment condition A5 
of this paper, respectively. 

Note that the definition of bad sets in *3] is different from that of this paper. 
In the bad set consists of all those parameter values having optimal arm not 
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in the same group and cannot be distinguished when samphng from the optimal 
arm. Here the bad set consists of parameter values that has different optimal arm 
{but still in the same group), and cannot be distinguished when sampling from the 
optimal arm(s). If we adopt the definition (12. lip , then it is clear that the bad sets 
are all empty under the setting of [3]. 

The infimums in Problem A of Fuh and Hu [3] is taken over the union of and 
the corresponding bad set. Because the bad sets in [5| are all empty as we point 
out earlier, the infimums is actually taken over Qi. With this understanding, it is 
straightforward to verify that the lower bound (|3.ip reduces to (3.2) of Fuh and 
Hu [a]. □ 



4. Super efHciency 

The strategy in the allocation of the observations is as follows. For the rationale 
of the proposed strategy and more detailed discussion, please see [if. Let no and 
ni be positive integers that increase to infinity with respect to N and satisfies 
no = o(logN) and ni = o(no). 

1. Estimation. Select uq observations from each arm in group 1 and let 9 be the 
maximum likelihood estimate (MLE) of 6 defined by 

J no 

(4.1) L(0) = ^^logpi,(Xi,(,_i),Xi,t;0), = argmaxL(0). 

Let £ = mm{i : Ns/2{0)ne, 0}. Select an adjusted MLE estimate Oa S Ns/2(0)r\ 
O^, (where (5— >OasiV— >cxData rate to be specified in Theorem 1 below), in the 
following manner. Let | • | denote the number of elements in a finite set and 

(4.2) J = max{| J(6l')| : 0' G Ns/2id) n Qe}- 
We require that 

(4.3) 0aeH:={9e Ns/2(e) n e, : | J(0)| = J}. 

The motivation behind considering an adjusted MLE is to estimate J{9) and the 
set Qi that 9 belongs to consistently. This has implications in the experimentation 
phase. We note that if \J{9)\ > 1, then J (9) need not be consistent for J (9) and if 
Qi lies on 8i \ O* [see (|2.2p and (|2.4p ] , then 9 need not be consistently inside . 
Conversely, the probabihty that J{9a) — J{9) and 9a lying inside Qt tends to 1 as 
N oo. 

Let 

(4.4) Bei9;5) = Ue'eHBi{9') 
and let {zij}i<i<i,i<j<ji minimize 

(4-5) ^^K(^a)-/i.,(^a)]2.y+ E K(^a)-/iO-(^a)]^f, 
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subject to the constraints 



(4.6) 



infe-eeJE/ii hjiOa. 0')zij + T.f=i h^Oa, 0')z2j} > 1, 



^^ie'eB,(e,s)i^^<i E/=i ^Oa, e')z,, + ^^^^^^^ h,{9a,0')z,,} > 1- 



Let fc = 1. 

2. Experimentation. If fc < select [z^j logiVj observations from arm kj, where 
[•J denotes the greatest integer function. If fc > £, we skip the experimentation stage. 
We note that if Bi{d; 6) is empty, then the last inequality in ()4.6|) is automatically 
satisfied and hence we can select "zn = ■ ■ ■ = zij^ = 0. In other words, if Bi{9] 5) is 
empty, then the experimentation stage is also skipped over for k = i. 

3. Testing. Start with a full set {fcl, . . . , kJk} of unrejected jobs. The rejection 
of a job is based on the following test statistic. Let Fk, 1 < fc < /, be a probability 
distribution with positive probability on all open subsets of U^^j.0i. Define 

(4.7) 



t/fe(n;A) = 



I\i=i Ilj=i ^^j i^ijOi A) n"=i Pii{Xij{t-i) , Xijt.X) 



for aU A G 9fc. 

(a) If 6* e yjiyk'di. Add one observation from each unrejected job. Reject param- 
eter A if ?7fe(n; A) > N. Reject a job kj if all A e 9^^ have been rejected at some 
point in the testing stage. If there is a job in group k left unrejected and the total 
number of observations is less than TV, repeat 3(a). Otherwise go to step 4. 

(b) If G 9fc: Add rii observations from each unrejected job fcj, j G J{9) and 
one observation from each unrejected job kj, j ^ J {9). Reject a job kj if all A G Qkj 
have been rejected at some point in the testing phase. If there is a job in group k 
left unrejected and the total number of observations is less than N, repeat 3(b). 
Otherwise, go to step 4. 

(c) If 6* G Ui<fc9i: Adopt the procedure of 3(a). 

4. Moving to the next group and termination. The strategy terminates once N 
observations have been collected. Otherwise, if fc < /, increment fc by 1 and go to 
step 2; if fc = /, select all remaining observations from a job Ij satisfying fJ.ij{0) = 
maxi<,j<j, nihid). 

In [l| Theorem 2, it was established that when Bi{9) is non-empty, then the 
asymptotic lower bound of the regret is attained with the procedure above. We 
shall show that the same procedure is not only asymptotically optimal but also 
the regret from the optimal group will be o(logA^) when Bi{9) = as oppose to 
O(logA^) when Be{9) 7^ 0. An important key step required in our proof is the 
consistency result 

(4.8) Pe{Bi(9, 5) = 0} ^ 1 as TV ^ 00 

under the empty bad set assumption. 
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Theorem 2. Let 6 E Qg. Assume A1-A5 and \1.5\) . Let uq oo with Uq = 
o{logN) and rii ^ oo such that ni = o(no). There exists 6{= 6n) I as N ^ oo 
such that 

(4.9) Pe{9ee\Nsi9)} = o{n^^) as ui oo. 
Moreover, if Bi{9) = 0, then holds and 

Jt 

(4.10) ^£;erAr(^j) = o(logiV). 
Hence 

(4.11) lim RNiO)/logN = z{0,e). 

N^QG 

Proof. The consistency of 9 in (|4.9p follows from A2 and (4.5) of Chan, Fuh and 
Hu We shall now prove (|4.8p . Since S i and is consistent for 9, it suffices 
from the definition of Be{9: S) in (14. 4|) to show that there exists (5o > such that 

(4.12) Bi{9) = for aU 9e Ns„{9) n 9^ with \ J{9)\ = J. 

We observe from the continuity of figj that there exists Si > such that J{9) C J{9) 
for all e Nsi{9) n 9^. Hence it follows that if \J(9)\ = \J{9)\, then it must be 
true that J (9) = ^(6*). We see from the definition of bad sets in ()2.1ip that for each 
9' G &i \ (Ujgj(e)9£j), Iej{9, 9') > for some j £ J{9) and hence by the continuity 
of the KuUback-Leibler information, there exists i52 > such that Lij{9,9') > 
whenever 9 £ Ns.^{9). Select 5q = min{(5i, (52}. Then holds. 

We shall next show (|4.10[) . By (|4.8p and since the experimentation stage is 
skipped over when k = i and Bi[9, 5) = 0, it suffices to show that the expected total 
number of observations taken from inferior arms in the testing stage is o(logA''). 
Define pn = Pe{J{9a) = J (9)}. Then by (gSl), and as J (9) C J (9) for aU 

9 G Nsi{9) for some Si > 0, 1 — pn — o(nj~^). By (2.16) and the assumption 
Bi{9) = 0, at least one optimal arm will provide positive information against each 
9' ^ Ujgj(e)9j. By A3-A5 and (6.4), (6.5) of Chan, Fuh and Hu [1], (an expected) 
0(log N) number of observations from arms with positive information is required 
to reject each 9' £ &g \ (Ujgj(6/)9fj). Hence 0{n^^ log A^) number of recursions is 
involved when J{9) ~ J{9a) because at least ni observations in each recursion has 
positive information. Similarly, OilogN) recursions is needed when J{9) ^ J{9a) 
because at least one observation in each recursion has positive information. The 
number of observations from inferior arms in each recursion is 0(1) if J{9a) = J{9) 
and 0{ni) otherwise. Hence the expected number of observations from inferior arms 
during the recursion steps in the testing phase is 

(4.13) pNO{n-HogN) + {\ - pN)0{ni\ogN) ^ o{\ogN). 

The asymptotic result ()4.1ip follows from (|4.10p and the proof of Chan, Fuh and 
Hu [H Theorem 2. □ 

For the special case £ = 1, it follows from (|4.1ip that Rn{9) — o(logiV) occurs. In 
and [l3|, a uniformly good procedure was proposed that satisfies Rn{9) = 0{1) 
when 9 is finite and 7=1. 
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5. The switching cost 

Let a{9) > be the switching cost between two arms and are not both optimal 
when the underlying parameter is 9. It is assumed here that there is no switching 
cost when both arms are optimal. Then 

JV-l 

Ln{9) := a{9)Ee{^ ^ l{0f/0t+i,min[p^, (e),^*^^^^ (9)]<M-(e)}) 

is the average switching cost of a procedure. It is also desirable that this cost is 
asymptotically negligible compared to the regret as ^ oo. 

Theorem 3. Under Assumptions Al - A5, the strategy (p* has average switching 
cost 

(5.1) LNi9) ^ o{\og N) as N ^ oo. 

Hence, the strategy is asymptotically optimal when there is switching cost. 

Proof. In the estimation stage it is require to take no observations from each arm 
in group 1. We can take the no observations in batches and switch only Ji — 1 
times. Therefore the switching cost from estimation stage is a{9){Ji — 1). In the 
experimentation stage, we need to allocate at most Zkj \ogN observations to arm kj. 
Again this can be done in batches and thus the switching cost from experimentation 
stage is at most a{9){Jk — 1). In the testing stage, it is shown in (6.12) of Chan, Fuh 
and Hu [1], that the expected total number of observations is o{\ogN) and thus 
the switching cost is no more than o(logA^). Adding the switching costs from the 
estimation, experimentation, and testing stages together, shows that the total cost 
due to switching is o(logA^). However, the regret lower bound is O(logA^), which 
implies that the switching cost constitutes a negligible part of the total regret as 
n oo. This completes the proof that the proposed strategy is still asymptotically 
optimal with constant cost per switch. □ 



6. Extension of Wald's equation to Markovian rewards 

As we will be focusing on a single arm ij and fixed parameters 9q, 9q such that 
/i :~ Iij{6Q,9q) > we will drop some of the references to i, j, 9q, 6q and q in 
this section. This applies also to the notations in assumptions A3-A5. Moreover, we 
shall use the notation E{-) as a short form of Es^{-) and Ex{-) as a short form of 

Ee„{-\Xo = x). Let = ^1 H where £,k = log[pij (Xfe_i, AT^; 0o)/Py (^fe-i, 

Xk;9q)] has stationary mean ji under Pe^ and let r be a stopping-time. We shall 
show that 

(6.1) ESr = KEt) - EYi{Xr)] + EYi{Xo)] 

for some function 7 to be specified in Lemma [TJ In Lemma [2l we show that the 
conditions on V in A4-A5 lead to bounds on 7(2;) and by applying Lemma [3l we 
obtain 

(6.2) E\-({Xr)\ + E\-({Xo)\ = o{Et). 
Substituting (|6.2[) back into (|6.ip . Wald's equation 



(6.3) 



ESr = [^l + o{l)]ET 
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is established for Markovian rewards. Under uniform recurrence condition, Fuh and 
Lai [3| established Wald's equation based on perturbation theory for the transition 
operator. The Wald's equation wasjaroved under the assumption that the solution 
for the Poisson equation exists in based on Poisson equation for the transition 
operator. In this section, we apply the idea of regeneration epoch to derive the 
Wald's equation for Markov random walks. 

By (|2.5p . we can augment the Markov additive process and create a split chain 
containing an atom, so that increments in S'„ between visits to the atom are in- 
dependent. More specifically, we construct stopping-times < k(1) < k{2) < ■ ■ ■ 
using an auxiliary randomization procedure such that 



(6.4) 



P{Xn+i e A, K{i) =n + l\Xn = X, K{i) > n> nil - 1)} 
_ \aLp{A) a; e G, 
I otherwise. 



Then by Lemma 3.1 of Ney and Nummelin [9|, 

(i) {«(« + 1) — K{i) : « = 1, 2, . . .} are i.i.d. random variables. 

(ii) the random blocks {^^(i), . . . , ^^(i+i)-!}, i = 1, 2, . . . , are independent and 

(iii) P{X«(i) e A\T^(i^^i} = ^{A), where T„=a-?ie\<i generated by {Xq, . . . , X„}. 

By (ii)-(iii), Eip{Sk ~ K/i) — 0. Define k = k{1). We shall use the notation "n — 
atom" to denote n — K{i) for some i. 

Lemma 1. Let^{x) — ExiS^ — Kfi). Then Zn — (S'n — n/i) +7(X„) is a martingale 
with respect to JF„. Hence i6. 1\) holds. 



Proof. We can express 

Zn = E{Sur, ~ UnfJ-l^n) whcrc Un — inf{m > n : m — atom}. 
If Xn — Xn ^ G, then by (|6.4p . t/„ > n + 1. Hence Un+i — Un and 

(6.5) E{Zn+l\Tn) ^ Zn 

because J-n+i D J-n- If Xn = a;„ e G, then by (j6.4p and (ii), 

E{Zn+l\Tn) - Zn = E[{Su„ + i " ^uj + [Un+l - Un)\Tn] = aE^{S^ ~ K^l) = 

and hence (|6.5p also holds. □ 
Lemma 2. Under A3-A5, 

|7(a;)| < r'[Vix) + b + (V* + b)V*ia-^ + l)](i^ + 1 + |/i|), 

where a satisfies (j2.5p . V* is defined in A4 and K is defined in (|2.9p . 
Proof. By (EH), 

(6.6) V{x) > K-^E^il > K-\E,,\^,\ ~ 1). 

Let < < a{2) < • ■ ■ be the hitting times of the set G and let a = cr(l). Let 

K 

(6.7) mn{A) = Ex[Y,V{Xn)lix„eA} 
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for all measurable set Ad D. By (|2.7|) . 

E^[V{Xn)l{a>n}] < (1 -/3)£;x[T^(^«-i)l{.>„-i}], n>2 

and 

E^[V{Xi)]<V{x) + b. 

Hence by induction, 

a oo 

(6.8) E,[Y,V{X^)] <[V{x)+b]J2{^-pr-^ = [V{x)+b]/p. 

71 — 1 n—1 

By deJl-jnH), and as F > 1, 

a oo cr(fe+l) 

n^l fc^l '/i^cr(A;) + l 

cr oo fT 

= ^ F(X„) + ^ Ex^,,, [ J2 ViXn)] l{.>.(fc)} } 

n—1 A;— 1 n—1 

oo 

< (3-'[V{x) + b]+ E.,{j2f3-^[V{X^^k)) + &]l{«>.(fc)}} 

(6.9) < f3-'[V{x) + b]+ f3-\V* + 6)m„(G). 

But by dOl), rnn{G) < V*{a-'^ + 1). Since j{x) < {K + 1 + \fi\)mn{D), LemmaE] 
holds. □ 

Let Wi = |7(X«,(j))| H h |7(^K(i+i)-i)|, for i > 1. Then by A3-A5, Lemma 

4 and its proof, and (i)-(iii), Wi,W2, ■ ■ ■ are i.i.d. with finite mean while by (|2.8p . 
Wo := |7(Xo)| + ■ • • + \'y{Xf^(i)_i)\ also has finite mean. 

Lemma 3. Let A/„ = maxi<fc<„ Wk- Then for any stopping-time t, 

(6.10) E{Mr) = o{Et). 

Proof. Let 5 > and let c(= q) > be large enough such that E\{Wi — c)+] < 5. 
We shall show that 

Zn = (A//„ V c) - n(5 
is a supermartingale. Indeed for any A > 0, 

E[Mr,+i V c|Af„ Vc = c + A] = c + A + E[{Wn+i - c - A)+] < c + A + 5 

and the claim is shown. Hence EZr < EZq ~ c and it follows that E(Mr) < 
E{Mr V c) < S{Et) + c. Lemma El then follows by letting 6 10. □ 

7. Appendix 

Proof. Proof of (|1.3p Let A"ijt denotes the tth observation taken from arm ij. Then 

I Ji I Ji oo 

(7.1) |M^^(0)-^X]A^,,(0)i?,TA,(zj)| <^^^|i?,5(A,,t)-M»,(^)l- 

i=l j=l 1=1 j = l t=l 
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For any signed measure A on {D,T>), let 
(7.2) \\\\\v^^ = sup / h{x)X{dx) 

h:\h\<Vij J 

It follows from Mcyn and Tweedie (Q, p. 367 and Theorem 16.0.1) that under A3 
and the geometric drift condition (|2.7I) . 



(7.3) cu,y.= sup ^||i^^-,(a;,-)-'r.,(0)|k.,/V-,(a;) <oo, 



where P^jtix, •) denotes the distribution of Xijt conditioned on Xijo = x and Tfij{0) 
denotes the stationary distribution of Xijt under parameter 6. By (|2.6p . there exists 
K > such that K\g{x)\ < Vij{x) for all a; G I? and hence it follows from (|7.2p and 
(TTSll that 



(7.4) \Ee^xg{Xijt) - < LUijVij{x), 



t=i 



where Eg^^ denotes expectation with respect to Pe and intial distribution XijQ = x. 

In general, for any initial distribution Vij{-]9), it follows from (|2.8p and (|7.4p 
that 

oo oo 

^ \Egg{X,jt) - ^ \^e.xg{Xijt) - ^l^J{9)\v^J{x■,0)Q{dx) < oo 

t=i •' t=i 

uniformly over 6* G 8 and hence (|1.3p follows from (|7.ip . □ 
References 

Chan, H. P., Fuh, C. D. and Hu, I. (2005). Optimal strategies for a class 
of sequential control problems with precedence relations. Annals of Statistics, 
to appear. 

Feldman, D. (1962). Contributions to the "two-armed bandit" problem. An- 
nals of Mathematical Statistics 33 847-856. MR0145625 

Fuh, CD. AND Hu, I. (2000). Asymptotically efficient strategies for a stochas- 
tic scheduling problem with order constraints. Annals of Statistics 28 1670- 
1695. MR1835036 

Fuh, C. D. and Lai, T. L. (1998). Wald's equations, first passage times 
and moments of ladder variables in Markov random walks. Journal of Applied 
Probability 35 566-580. MR1659504 

Fuh, C. D. and Zhang, C. H. (2000). Poisson equation, moment inequalities 
and r-quick convergence for Markov random walks. Stochastic Processes and 
their Applications 87 53-67. [MR1751164 

Hu, I. AND Wei, C. Z. (1989). Irreversible adaptive allocation rules. Annals 
of StatisticslT 801-823. MR09 94269 

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allo- 
cation rules. Advances in Applied Mathematics 6 4-22. MR0776826 
Meyn, S. p. and Tweedie, R. L. (1993). Markov Chain and Stochastic 



[2] 
[3] 



[6] 
[7] 

[9 
[10 



Stability. Springer- Verlag, New York. I MR1287609 I 

Ney, p. and Nummelin, E. (1987). Markov additive processes I: eigenvalue 
properties and limit theorems. Annals of Probability 15 561-592. MR0885131 
Rodman, L. (1978). On the many-armed bandit problem. Annals of Probabil- 
ity 6 491-498. MR 0494728I 



